true

Introduction

Are we alone in the Universe? This is one of the most profound questions that humankind has sought to answer since the beginning of recorded history. We can gain some insight into this mystery with the modern search for exoplanets. The underlying purpose of contemporary exoplanet programs is to discover habitable planets, especially around nearby stars, and find evidence of life elsewhere in the cosmos. Using Earth and its lifeforms as a template to assess habitability, we seek other celestial bodies with conditions similar to our own, i.e., with Earth-like surface gravities and temperatures where liquid water could exist. In order to meet these criteria, most habitable worlds must be a particular mass and a particular distance from their host star. We are interested in analyzing the data for confirmed exoplanets to asses our progress in finding a planet with similar physical characteristics to Earth.

For our investigation, we used data from the NASA Exoplanet Archive, which can be found at the following URL https://exoplanetarchive.ipac.caltech.edu. This archive is an online compilation, collation, and cross-correlation of astronomical data and information on exoplanets and their host stars. The data are vetted by a team of astronomers at the California Institute of Technology (Caltech) under contract with the National Aeronautics and Space Administration (NASA) under the Exoplanet Exploration Program. An extensive overview of the data, services and tools of the archive can be found in a published paper by Akeson, et al. (2013, PASP, 125, 989) in the Publications of the Astronomical Society of the Pacific (PASP). This publication can be found here.

We downloaded our dataset from the NASA Exoplanet Archive on 2018 July 5. The data consists of 354 columns and 3,748 rows of information on confirmed exoplanets and their host stars as well as information about their discovery. Discovery information includes the method used to detect the exoplanet, the locale of the observatories used for detection (i.e., whether ground-based, space-based, or a mixture of both observations were used for detection), and the year of discovery. Physical characteristics of exoplanets present in the archive include planetary mass, orbital period, and orbital semi-major axis. Physical properties of host stars listed in the archive include stellar mass, stellar radius, effective temperature, surface gravity, spectral type, luminosity, and distance from Earth. These physical properties for both exoplanets and their host stars are important for determining the similarity between an exoplanet and Earth, and their detection sensitivity.

We trimmed and removed space characters from the original dataset from the NASA Exoplanet Archive to the 15 columns that are most relevant and directly related to habitable planets. These data are stored in the file planets.csv. The variables for which we are interested as responses are pl_bmasse, pl_orbsmax and st_dist are the planet mass in Earth masses (\(R_\odot = 5.9722 \times 10^{24}\,\rm{kg}\)), the planet orbital semi-major axis in astronomical units (\(\rm{AU} = 1.495978707 \times 10^{11}\,\rm{m}\)), and the distance from our solar system to the exoplanet system in parsecs (\(\rm{pc} = 3.0857 \times 10^{16}\,\rm{m}\)). These response variables are key physical parameters that determine the habitability of a planet and the potential feasibility of reaching such a planet.

Methods

Data preprocessing

To ensure we had a consistent dataset with no empty values for any responses or predictors for which we were considering, we filtered out rows with any empty values and were left with 594 rows.

planets <- read.csv("planets.csv")
planets_good <- planets[complete.cases(planets), ]

The structure of the R object for which we will perform our analysis is as follows.

str(planets_good)
## 'data.frame':    594 obs. of  15 variables:
##  $ pl_discmethod: Factor w/ 10 levels "Astrometry","EclipseTimingVariations",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ pl_disc      : int  2007 2009 2008 2002 1996 2018 2010 2010 2009 2008 ...
##  $ pl_locale    : Factor w/ 3 levels "Ground","MultipleLocales",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ st_dist      : num  110.6 119.5 76.4 18.1 21.4 ...
##  $ st_optmag    : num  4.74 5.02 5.23 6.61 6.25 ...
##  $ st_teff      : num  4742 4213 4813 5338 5750 ...
##  $ st_mass      : num  2.7 2.78 2.2 0.9 1.08 0.99 1.54 1.54 1.93 0.98 ...
##  $ st_rad       : num  19 29.79 11 0.93 1.13 ...
##  $ st_logg      : num  2.31 1.93 2.63 4.45 4.36 2.42 3.5 3.5 4.43 1.71 ...
##  $ st_metfe     : num  -0.35 -0.02 -0.24 0.41 0.06 -0.77 -0.03 -0.03 0.19 -0.46 ...
##  $ st_vsini     : num  1.2 1.5 2.6 1.6 2.18 3.36 2.77 2.77 38.3 1 ...
##  $ pl_orbper    : num  326 516 186 1773 798 ...
##  $ pl_orbsmax   : num  1.29 1.53 0.83 2.93 1.66 ...
##  $ pl_orbeccen  : num  0.231 0.08 0 0.37 0.68 0.042 0.09 0.29 0.29 0.38 ...
##  $ pl_bmasse    : num  6166 4685 1526 1481 566 ...

There are two factor variables and 13 numeric variables. The following table briefly describes the variables.

Variable name Variable description
pl_discmethod Method of discovery (RadialVelocity, Transit, TransitTimingVariations, etc.)
pl_disc Year of discovery
pl_locale Locale of discovery (Ground, MultipleLocales or Space)
st_dist Stellar distance (parsecs)
st_optmag Stellar apparent magnitude (mag)
st_teff Stellar effective temperature (Kelvin)
st_mass Stellar mass (solar masses)
st_rad Stellar radius (solar radii)
st_logg Stellar surface gravity (log g)
st_metfe Stellar metallicity (log(Fe/H or M/H))
st_vsini Stellar projected rotational velocity (m/s)
pl_orbper Planet orbital period (days)
pl_orbsmax Planet orbital semi-major axis (AU)
pl_orbeccen Planet orbital eccentricity
pl_bmasse Planet mass (Earth masses)

Some of these variables may be collinear to each other. We can inspect for collinearity visually using a matrix of scatterplots of the variables.

From the matrix of scatterplots, it appears that planet orbital period (pl_orbper) and orbital semi-major axis (pl_orbsmax) are collinear. This collinearity is not unexpected since orbital period and semi-major axis are related to each other physically as supported in Kepler’s third law of planetary motion. It also appears that stellar radius (st_rad) and stellar surface gravity (st_logg) are collinear as well. Collinearity makes estimating model coefficients and interpreting models more difficult, but it does not affect model predictions. We will keep this in mind as we search for potential models for our data.

Results

Discussion

Appendix

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
get_bp_decision = function(model, alpha) {
  decide = unname(bptest(model)$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha) {
  decide = unname(shapiro.test(resid(model))$p.value < alpha)
  ifelse(decide, "Reject", "Fail to Reject")
}

get_num_params = function(model) {
  length(coef(model))
}

get_loocv_rmse = function(model) {
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_adj_r2 = function(model) {
  summary(model)$adj.r.squared
}